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Identifying duplicate documents from search results without comparing document content 

EW Brown, JM Prager ■ US Patent 5,91 3,208, 1 93S - Google Patents 

... 15,1999 [54] IDENTIFYING DUPLICATE DOCUMENTS FROM SEARCH RESULTS WITHOUT COMPARING 
DOCUMENT CONTENT [75] Inventors: Eric William Brown, New Fairfield, Conn ... 



Method and system for detecting duplicate documents in web crawls 
D Meyerzosi S Shoroff, FS Terek, S Norin - US Patent 8,547,828, 2003 - Google Patents 
... The Web crawler program 200 may retrieve electronic document information for ... Detecting 
Duplicate Documents Using Content Identifiers As mentioned above ... 
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[PDF] * On the evolution of clusters of near-duplicate web pages 

D Fetterly, M Manasse, M Najork - Proceedings of the 1st Latin American Web Congress, 2003 - cwr.c! 

near-duplicates of the 13,283,856"canonical" documents representing the ... 



Understanding Content Reuse on the Web: Static and Dynamic Analyses - * 

R Baeza-Yates, A Pereira, N Ziviani ■ Lecture Notes In Computer Science, 2008 - Springer 
... In order to design a Web crawler, many different aspects must be ... new document matches 
all the previous conditions, any duplicate of this document cannot be ... 
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Managing duplicates in a web archive - [fdr 

D Gomes, AL Santos, MJ Sllva - Proceedings of the 2008 ACM symposium on Applied computing, 2008 - poriaLacm.org 
... reducing the probability of an incremental web crawler downloading a ... on one of the 
volumes, the document is considered to be a duplicate and its ... 

mm - Mmim - m seM - mi visions 



http://scholar.googlexom/scholar?hl=en&q=duplicate+document+webcrawler+id+OR+identifier+content&spell=l (1 of 3)7/6/2009 5:30:34 PM 



duplicate document webcrawler id OR identifier content - Google Scholar 



' OverCite: A distributed, cooperalive CileSeer 

J Stribling, J Li, IG Council MP Kaashoek, R ... - Proc, 2006 NSDI, 2008 ■ usenix.org 

... 3.4 Web Crawler. ... eg, title, authors, citations, etc.) as well as the bare ASCII text 
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Understanding Content Reuse on the Web: Static and Dynamic Analyses 
S Barcelona - Advances in Web Mining and Web Usage Analysis: 8th .... 2007 - books.google.corn 
... In order to design a Web crawler, many different aspects must be ... new document matches 
all the previous conditions, any duplicate of this document cannot be ... 
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System and method for classifying electronically posted documents 
AWL Huang, N Sundaresan • US Patent App. 1 1/526,470, 2006 - Google Patents 
... search engine typically uses proprietary webcrawler and indexing ... to be duplicates, 
the duplicate mete- data ... module reads the downloaded document and generates ... 



OverCite: A cooperative digital research library- *psi m w 
J Stribling, IG Council!, J Li, MP Kaashoek, DR...- Lecture notes in computer science, 2005 * Springer 
... peer-to-peer search optimizations [23,4]. Web crawler. ... If the document is not a 
duplicate, the crawler ... PC which papers to read (using the document-alert feature ... 
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Merging techniques for performing data fusion on the web - «fuciraiat pFj 
T Tsikrika, M Laimas - Proceedings of the tenth international conference on.,., 2001 - portaLacm.org 
... com) and Webcrawler (http://www.webcrawler.com), selected ... set to 30, since the more 
documents retrieved, the ... an increase on the number of duplicate documents is ... 
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